To reduce Barcelona’s accident rate, it is informative to learn about the location and the time that accidents occurs most. It is also useful to explore the correlation between accident and other factors including transportation types and unemployment rate. Through analyzing location, time, and other factors that correlate with accident, this project will show possible prevention for Barcelona’s accident rate.
The link to our dataset: https://www.kaggle.com/xvivancos/barcelona-data-sets#unemployment.csv
This project explores the determinants of traffic accidents in Barcelona. Particularly, we want to study whether location, transportation type, time, and unemployment rate is correlated with traffic accidents.
The 3 datasets were combined using Python. For instance, the transportation dataset was cleaned and then each transportation type were turned into dummies features. For unemployment dataset, we calculated the sum of the registered unemployment individuals over one year per each district in Barcelona. We then merged the employment rate with the original dataset based on the key of District Name. We deleted all NA values since our dataset has a large amount of samples. We then turned all the district name, week days, and month into dummies variable as well.
We created a data set with categorical target that represents the severity of accidents based on mild and serious injuries.
In this project, we want to focus on studying the correlations between traffic accidents and several non-human factors, including location, transportation type, time, and unemployment rate. Therefore, we humanly selected a set of features based on our research interests before running the models.
Particularly, we want to study how traffic accidents perform during summer time (June - August), and at locations that have commonly seen transportation types, including Airport train, Cableway, Maritime station, Railway (FGC), and Underground.
# Draw the bar Plot from f_importances
h = f_importances.plot(x='Features', y='Importance', kind='bar', figsize=(16,9), rot=80, fontsize=20)
#show the plot
h.set_ylabel('Importance', fontsize = 18)
h.set_xlabel('Features name', fontsize = 18)
plt.tight_layout()
plt.show()
# Sort best_score_param_estimators in descending order of the best_score_
best_score_param_estimators = sorted(best_score_param_estimators, key=lambda x : x[0], reverse=True)
# For each [best_score_, best_params_, best_estimator_]
for best_score_param_estimator in best_score_param_estimators:
# Print out [best_score_, best_params_, best_estimator_], where best_estimator_ is a pipeline
# Since we only print out the type of classifier of the pipeline
print([best_score_param_estimator[0], best_score_param_estimator[1], type(best_score_param_estimator[2].named_steps['clf'])], end='\n\n')
[0.9024322446143155, {'clf__min_samples_leaf': 1, 'clf__min_samples_split': 2, 'clf__n_estimators': 30}, <class 'sklearn.ensemble.forest.RandomForestClassifier'>]
[0.9000463284688441, {'clf__min_samples_leaf': 1, 'clf__min_samples_split': 2}, <class 'sklearn.tree.tree.DecisionTreeClassifier'>]
[0.6218438730599953, {'clf__n_neighbors': 9}, <class 'sklearn.neighbors.classification.KNeighborsClassifier'>]
dot_data = export_graphviz(pipe_dt.named_steps['DecisionTreeClassifier'],
filled=True,
rounded=True,
feature_names=feature_value_names,
out_file=None)
graph = graph_from_dot_data(dot_data)
Image(graph.create_png())